21 research outputs found
Chiller: Contention-centric Transaction Execution and Data Partitioning for Modern Networks
Distributed transactions on high-overhead TCP/IP-based networks were
conventionally considered to be prohibitively expensive and thus were avoided
at all costs. To that end, the primary goal of almost any existing partitioning
scheme is to minimize the number of cross-partition transactions. However, with
the new generation of fast RDMA-enabled networks, this assumption is no longer
valid. In fact, recent work has shown that distributed databases can scale even
when the majority of transactions are cross-partition. In this paper, we first
make the case that the new bottleneck which hinders truly scalable transaction
processing in modern RDMA-enabled databases is data contention, and that
optimizing for data contention leads to different partitioning layouts than
optimizing for the number of distributed transactions. We then present Chiller,
a new approach to data partitioning and transaction execution, which aims to
minimize data contention for both local and distributed transactions. Finally,
we evaluate Chiller using various workloads, and show that our partitioning and
execution strategy outperforms traditional partitioning techniques which try to
avoid distributed transactions, by up to a factor of 2
Honeycomb: ordered key-value store acceleration on an FPGA-based SmartNIC
In-memory ordered key-value stores are an important building block in modern
distributed applications. We present Honeycomb, a hybrid software-hardware
system for accelerating read-dominated workloads on ordered key-value stores
that provides linearizability for all operations including scans. Honeycomb
stores a B-Tree in host memory, and executes SCAN and GET on an FPGA-based
SmartNIC, and PUT, UPDATE and DELETE on the CPU. This approach enables large
stores and simplifies the FPGA implementation but raises the challenge of data
access and synchronization across the slow PCIe bus. We describe how Honeycomb
overcomes this challenge with careful data structure design, caching, request
parallelism with out-of-order request execution, wait-free read operations, and
batching synchronization between the CPU and the FPGA. For read-heavy YCSB
workloads, Honeycomb improves the throughput of a state-of-the-art ordered
key-value store by at least 1.8x. For scan-heavy workloads inspired by cloud
storage, Honeycomb improves throughput by more than 2x. The cost-performance,
which is more important for large-scale deployments, is improved by at least
1.5x on these workloads
Understanding PCIe performance for end host networking
In recent years, spurred on by the development and availability of programmable NICs, end hosts have increasingly become the enforcement point for core network functions such as load balancing, congestion control, and application specific network offloads. However, implementing custom designs on programmable NICs is not easy: many potential bottlenecks can impact performance.
This paper focuses on the performance implication of PCIe, the de-facto I/O interconnect in contemporary servers, when interacting with the host architecture and device drivers. We present a theoretical model for PCIe and pcie-bench, an open-source suite, that allows developers to gain an accurate and deep understanding of the PCIe substrate. Using pcie-bench, we characterize the PCIe subsystem in modern servers. We highlight surprising differences in PCIe implementations, evaluate the undesirable impact of PCIe features such as IOMMUs, and show the practical limits for common network cards operating at 40Gb/s and beyond. Furthermore, through pcie-bench we gained insights which guided software and future hardware
architectures for both commercial and research oriented network cards
and DMA engines
I/O Is Faster Than the CPU - Let's Partition Resources and Eliminate (Most) OS Abstractions
Peer reviewe
RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs
Modern online services come with stringent quality requirements in terms of response time tail latency. Because of their decomposition into fine-grained communicating software layers, a single user request fans out into a plethora of short, μs-scale RPCs, aggravating the need for faster inter-server communication. In reaction to that need, we are witnessing a technological transition characterized by the emergence of hardware-terminated user-level protocols (e.g., InfiniBand/RDMA) and new architectures with fully integrated Network Interfaces (NIs). Such architectures offer a unique opportunity for a new NI-driven approach to balancing RPCs among the cores of manycore server CPUs, yielding major tail latency improvements for μs-scale RPCs. We introduce RPCValet, an NI-driven RPC load-balancing design for architectures with hardware-terminated protocols and integrated NIs, that delivers near-optimal tail latency. RPCValet's RPC dispatch decisions emulate the theoretically optimal single-queue system, without incurring synchronization overheads currently associated with single-queue implementations. Our design improves throughput under tight tail latency goals by up to 1.4x, and reduces tail latency before saturation by up to 4x for RPCs with μs-scale service times, as compared to current systems with hardware support for RPC load distribution. RPCValet performs within 15% of the theoretically optimal single-queue system
Hermes: a Fast, Fault-Tolerant and Linearizable Replication Protocol
Today's datacenter applications are underpinned by datastores that are
responsible for providing availability, consistency, and performance. For high
availability in the presence of failures, these datastores replicate data
across several nodes. This is accomplished with the help of a reliable
replication protocol that is responsible for maintaining the replicas
strongly-consistent even when faults occur. Strong consistency is preferred to
weaker consistency models that cannot guarantee an intuitive behavior for the
clients. Furthermore, to accommodate high demand at real-time latencies,
datastores must deliver high throughput and low latency.
This work introduces Hermes, a broadcast-based reliable replication protocol
for in-memory datastores that provides both high throughput and low latency by
enabling local reads and fully-concurrent fast writes at all replicas. Hermes
couples logical timestamps with cache-coherence-inspired invalidations to
guarantee linearizability, avoid write serialization at a centralized ordering
point, resolve write conflicts locally at each replica (hence ensuring that
writes never abort) and provide fault-tolerance via replayable writes. Our
implementation of Hermes over an RDMA-enabled reliable datastore with five
replicas shows that Hermes consistently achieves higher throughput than
state-of-the-art RDMA-based reliable protocols (ZAB and CRAQ) across all write
ratios while also significantly reducing tail latency. At 5% writes, the tail
latency of Hermes is 3.6X lower than that of CRAQ and ZAB.Comment: Accepted in ASPLOS 202
Guidelines for the use and interpretation of assays for monitoring autophagy (3rd edition)
In 2008 we published the first set of guidelines for standardizing research in autophagy. Since then, research on this topic has continued to accelerate, and many new scientists have entered the field. Our knowledge base and relevant new technologies have also been expanding. Accordingly, it is important to update these guidelines for monitoring autophagy in different organisms. Various reviews have described the range of assays that have been used for this purpose. Nevertheless, there continues to be confusion regarding acceptable methods to measure autophagy, especially in multicellular eukaryotes. For example, a key point that needs to be emphasized is that there is a difference between measurements that monitor the numbers or volume of autophagic elements (e.g., autophagosomes or autolysosomes) at any stage of the autophagic process versus those that measure fl ux through the autophagy pathway (i.e., the complete process including the amount and rate of cargo sequestered and degraded). In particular, a block in macroautophagy that results in autophagosome accumulation must be differentiated from stimuli that increase autophagic activity, defi ned as increased autophagy induction coupled with increased delivery to, and degradation within, lysosomes (inmost higher eukaryotes and some protists such as Dictyostelium ) or the vacuole (in plants and fungi). In other words, it is especially important that investigators new to the fi eld understand that the appearance of more autophagosomes does not necessarily equate with more autophagy. In fact, in many cases, autophagosomes accumulate because of a block in trafficking to lysosomes without a concomitant change in autophagosome biogenesis, whereas an increase in autolysosomes may reflect a reduction in degradative activity. It is worth emphasizing here that lysosomal digestion is a stage of autophagy and evaluating its competence is a crucial part of the evaluation of autophagic flux, or complete autophagy. Here, we present a set of guidelines for the selection and interpretation of methods for use by investigators who aim to examine macroautophagy and related processes, as well as for reviewers who need to provide realistic and reasonable critiques of papers that are focused on these processes. These guidelines are not meant to be a formulaic set of rules, because the appropriate assays depend in part on the question being asked and the system being used. In addition, we emphasize that no individual assay is guaranteed to be the most appropriate one in every situation, and we strongly recommend the use of multiple assays to monitor autophagy. Along these lines, because of the potential for pleiotropic effects due to blocking autophagy through genetic manipulation it is imperative to delete or knock down more than one autophagy-related gene. In addition, some individual Atg proteins, or groups of proteins, are involved in other cellular pathways so not all Atg proteins can be used as a specific marker for an autophagic process. In these guidelines, we consider these various methods of assessing autophagy and what information can, or cannot, be obtained from them. Finally, by discussing the merits and limits of particular autophagy assays, we hope to encourage technical innovation in the field
Design Guidelines for High Performance RDMA Systems Design Guidelines for High Performance RDMA Systems
Abstract Modern RDMA hardware offers the potential for exceptional performance, but design choices including which RDMA operations to use and how to use them significantly affect observed performance. This paper lays out guidelines that can be used by system designers to navigate the RDMA design space. Our guidelines emphasize paying attention to low-level details such as individual PCIe transactions and NIC architecture. We empirically demonstrate how these guidelines can be used to improve the performance of RDMA-based systems: we design a networked sequencer that outperforms an existing design by 50x, and improve the CPU efficiency of a prior highperformance key-value store by 83%. We also present and evaluate several new RDMA optimizations and pitfalls, and discuss how they affect the design of RDMA systems